Data Management and Preservation Planning for Big Science

نویسندگان

  • Juan Bicarregui
  • Norman Gray
  • Rob Henderson
  • Roger Jones
  • Simon C. Lambert
  • Brian Matthews
چکیده

‘Big Science’ that is, science which involves large collaborations with dedicated facilities, and involving large data volumes and multinational investments – is often seen as different when it comes to data management and preservation planning. Big Science handles its data differently from other disciplines and has data management problems that are qualitatively different from other disciplines. In part, these differences arise from the quantities of data involved, but possibly more importantly from the cultural, organisational and technical distinctiveness of these academic cultures. Consequently, the data management systems are typically and rationally bespoke, but this means that the planning for data management and preservation (DMP) must also be bespoke. These differences are such that ‘just read and implement the OAIS specification’ is reasonable Data Management and Preservation (DMP) advice, but this bald prescription can and should be usefully supported by a methodological ‘toolkit’, including overviews, case-studies and costing models to provide guidance on developing best practice in DMP policy and infrastructure for these projects, as well as considering OAIS validation, audit and cost modelling. In this paper, we build on previous work with the LIGO collaboration to consider the role of DMP planning within these big science scenarios, and discuss how to apply current best practice. We discuss the result of the MaRDI-Gross project (Managing Research Data Infrastructures – Big Science), which has been developing a toolkit to provide guidelines on the application of best practice in DMP planning within big science projects. This is targeted primarily at projects’ engineering managers, but intending also to help funders collaborate on DMP plans which satisfy the requirements imposed on them. International Journal of Digital Curation (2013), 8(1), 29–41. http://dx.doi.org/10.2218/ijdc.v8i1.247 The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. The IJDC is published by UKOLN at the University of Bath and is a publication of the Digital Curation Centre. ISSN: 1746-8256. URL: http://www.ijdc.net/ 30 Data Management and Preservation doi:10.2218/ijdc.v8i1.247 Introduction: The Big-Science Paradigm There appears to be a rough consensus that many of the central concerns regarding data management and preservation (DMP) for the majority of academic disciplines relate to the management of a large collection of disparate data generated in a relatively undisciplined manner by a wide variety of independent researchers, who are mainly concerned with their own research. Thus the concerns are about the usability of repositories, the challenges of persuading researchers to deposit their data, and how best to manage the citation of data, with additional concerns about how researchers may best receive credit for the data they have collected. However, implicit within this view appears to be a rather simple conceptual model of what it is that researcher-users actually do to create the data. Researchers: (i) obtain grants, which (ii) they use to generate data within their own research methods which (iii) they manage locally and then (iv) share either as datasets or linked to publications. Much of the interest in research information systems within this area presumes a rather simple relationship between (i), (ii) and (iv), and much of the DMP effort appears to be concerned with persuading researchers to do step (iii) better, possibly with suitable institutional assistance, cajoling or prescription. This model is less appropriate for large-scale projects in the physical sciences, which have decades of experience with data management and sharing, at scale, incorporating a data management workflow that is different from this paradigmatic one under each of its four headings. This incompleteness suggests firstly that the DMP solutions created under this paradigm, when applied to other disciplines, may not be as generally applicable as expected; and secondly that there are data management problems outside those automatically considered by that paradigm, which are nonetheless well-understood, and for which practical solutions already exist. For our purposes, such ‘Big Science’ projects tend to share many features which distinguish them from other research disciplines. These include the following: 1. The projects are large collaborations, involving hundreds or thousands of researchers from many institutions, typically in different countries. 2. The projects last many years, with extended planning and set up phases, and long lifetimes of experimental running, data collection and analysis. 3. The projects are funded with long term budgets, and typically from multiple sources, thus requiring complex legal agreements on resource provision and ownership. 4. The projects typically establish dedicated experimental facilities, with their own structures and dedicated technical staff, including computing support. 5. The projects typically generate large volumes of complicated and instrument-specific data (1–10PB per year, with exabyte-per-year rates anticipated in the next decade). The International Journal of Digital Curation Volume 8, Issue 1 | 2013 doi:10.2218/ijdc.v8i1.247 Juan Bicarregui et al. 31 The key feature, from the point of view of this paper, is that this is facilities science. There is a core facility, with multinational funders, a multi-decade existence, and a conceptual and administrative separation between the elaborately-engineered resource and the research scientists. Particle physics has the longest experience with this model of doing science, most famously in the Large Hadron Collider (LHC) collaboration centred at CERN, but gravitational wave physics (e.g. LIGO) and radio astronomy (e.g. SKA) have or will have similar or larger collaboration sizes and data volumes. Other areas of astronomy have long experience with internationally shared telescope facilities, though working at a different scale. Structural sciences (i.e. studies into the microand nano-scale structure of matter) are moving towards this model of working, where large-scale facilities, such as neutron and synchrotron sources, support many individual scientists working within the traditional DMP paradigm. However, the facility itself – with its continuity of funding over a long period, dedicated infrastructure and specialist staff – has the characteristics of “big science” and can leverage those characteristics to provide more systematic data management for its user community (Flannery et al., 2009). Preservation policy and practice in big science deals with large volumes of data in large (100s to 1000s) collaborations, with technically sophisticated users and computing support. The data volume is the least significant feature in the present context, since it is ‘only’ a technical problem; the other two features change the game. This scale of working produces some simplifications: • It is well resourced: DMP is not the responsibility of quarter-time junior researchers, but a key concern of the project’s engineering management. • There is a collaborative ethos, which has data sharing at its core. Data, once acquired, goes directly into the archive and is retrieved from there for processing by researchers. However the scale also produces a variety of complications: • There are multiple funders in multiple countries with various, sometimes conflicting, requirements relating to data management and dissemination. • The multiplicity of funders often means that no one can dictate terms. • Experiments and their datasets are governed by networks of Memoranda of Understanding and Service Level Agreements and in-collaboration decision-making processes which, however intricate the process, are fundamentally consensus-based. • The intellectual property of the data is often complex. Thus the nature of big science determines that it cannot benefit from the considerable effort going into providing technical and software support for DMP 1 CERN the European Organization for Nuclear Research: www.cern.ch 2 Laser Interferometer Gravitational Wave Observatory: www.ligo.caltech.edu 3 The Square Kilometre Array: http://www.skatelescope.org/ The International Journal of Digital Curation Volume 8, Issue 1 | 2013 32 Data Management and Preservation doi:10.2218/ijdc.v8i1.247 planning. It is in this context that the advice of a recent JISC-funded study of data at this scale to “just read and implement OAIS” (Gray et al., 2012) is more practical than it appears. Facilities-scale science projects have the financial and engineering resources, and technical expertise to produce bespoke DMP plans and data management systems. However, what must be avoided is pointless reinvention, and so there is an outstanding need for a fast-track to an optimal solution. This is where funder support can be helpful in supporting the relevant technical personnel by connecting them to high-level DMP best practice. The MaRDI-Gross project is building on previous work by developing practical advice for large-scale DMP planning. It is based on the insights of the OAIS reference model, and includes discussions on cost modelling, with a target audience of big science practitioners and funders. The emphasis has largely been on the UK community associated with the Science and Technology Facilities Council (STFC), the major funder of big science in the UK. The guidelines apply more widely, as much of the work of STFC is in collaboration with similar bodies in other countries and with cross-national institutions, and thus the guidelines can also apply in those cases. The goal of the project is to bring big science practitioners up to speed with the current best practice, and to equip funders with the means to critically engage with DMP planners, giving both groups a rapid boost towards relevant disciplinary best practice. In the rest of this paper we consider the factors considered in these guidelines, and discuss the implications of relevance for the wider DMP community. The full guidelines can be found in Bicarregui et al. (2012).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Managing Research Data in Big Science

The project which led to this report was funded by JISC in 2010–2011 as part of its ‘Managing Research Data’ programme, to examine the way in which Big Science data is managed, and produce any recommendations which may be appropriate. Big science data is different: it comes in large volumes, and it is shared and exploited in ways which may differ from other disciplines. This project has explore...

متن کامل

Big Data Analytics and Now-casting: A Comprehensive Model for Eventuality of Forecasting and Predictive Policies of Policy-making Institutions

The ability of now-casting and eventuality is the most crucial and vital achievement of big data analytics in the area of policy-making. To recognize the trends and to render a real image of the current condition and alarming immediate indicators, the significance and the specific positions of big data in policy-making are undeniable. Moreover, the requirement for policy-making institutions to ...

متن کامل

Privacy and Security of Big Data in THE Cloud

Big data has been arising a growing interest in both scien- tific and industrial fields for its potential value. However, before employing big data technology into massive appli- cations, a basic but also principle topic should be investigated: security and privacy. One of the biggest concerns of big data is privacy. However, the study on big data privacy is still at a very early stage. Many or...

متن کامل

Privacy and Security of Big Data in THE Cloud

Big data has been arising a growing interest in both scien- tific and industrial fields for its potential value. However, before employing big data technology into massive appli- cations, a basic but also principle topic should be investigated: security and privacy. One of the biggest concerns of big data is privacy. However, the study on big data privacy is still at a very early stage. Many or...

متن کامل

P-V-L Deep: A Big Data Analytics Solution for Now-casting in Monetary Policy

The development of new technologies has confronted the entire domain of science and industry with issues of big data's scalability as well as its integration with the purpose of forecasting analytics in its life cycle. In predictive analytics, the forecast of near-future and recent past - or in other words, the now-casting - is the continuous study of real-time events and constantly updated whe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJDC

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2013